Low-cost, High-Performance Translation Retrieval: Dumber is Better
نویسنده
چکیده
In this paper, we compare the relative effects of segment order, segmentation and segment contiguity on the retrieval performance of a translation memory system. We take a selection of both bag-of-words and segment order-sensitive string comparison methods, and run each over both characterand word-segmented data, in combination with a range of local segment contiguity models (in the form of N-grams). Over two distinct datasets, we find that indexing according to simple character bigrams produces a retrieval accuracy superior to any of the tested word Ngram models. Further, in their optimum configuration, bag-of-words methods are shown to be equivalent to segment ordersensitive methods in terms of retrieval accuracy, but much faster. We also provide evidence that our findings are scalable.
منابع مشابه
A Low Cost Machine Translation Method for Cross-Lingual Information Retrieval
In one form or another language translation is a necessary part of cross-lingual information retrieval systems. Often times this is accomplished using machine translation systems. However, machine translation systems offer low quality for their high costs. This paper proposes a machine translation method that is low cost while improving translation quality. This is done by utilizing multiple we...
متن کاملTransitive probabilistic CLIR models
Transitive translation could be a useful technique to enlarge the number of supported language pairs for a cross-language information retrieval (CLIR) system in a cost-effective manner. The paper describes several setups for transitive translation based on probabilistic translation models. The transitive CLIR models were evaluated on the CLEF test collection and yielded a retrieval effectivenes...
متن کاملHybrid Approach of Query and Document Translation with Pivot Language for Cross-Language Information Retrieval
This paper reports experimental results of cross-language information retrieval (CLIR) from German to French, in which a hybrid approach of query and document translation was attempted, i.e, combining results of query translation (German to French) and of document translation (French to German). In order to avoid too high complexity of computation for translating a large amount of texts in docu...
متن کاملEmbedding Web-based Statistical Translation Models in Cross-Language Information Retrieval
Although more and more language pairs are covered by machine translation services, there are still many pairs that lack translation resources. Cross-language information retrieval (CLIR) is an application which needs translation functionality of a relatively low level of sophistication since current models for information retrieval (IR) are still based on a bag-of-words. The Web provides a vast...
متن کاملA new shape retrieval method using the Group delay of the Fourier descriptors
In this paper, we introduced a new way to analyze the shape using a new Fourier based descriptor, which is the smoothed derivative of the phase of the Fourier descriptors. It is extracted from the complex boundary of the shape, and is called the smoothed group delay (SGD). The usage of SGD on the Fourier phase descriptors, allows a compact representation of the shape boundaries which is robust ...
متن کامل